Skip to content

fix(snell): drain pending data before pool reuse#2830

Open
missuo wants to merge 1 commit into
MetaCubeX:Alphafrom
missuo:fix-snell-reuse-pool
Open

fix(snell): drain pending data before pool reuse#2830
missuo wants to merge 1 commit into
MetaCubeX:Alphafrom
missuo:fix-snell-reuse-pool

Conversation

@missuo

@missuo missuo commented May 24, 2026

Copy link
Copy Markdown

Summary

The reuse-mode snell client pool surfaces stale server data on every reuse, so against a real Surge snell-server (v5.0.1 observed) only the first session per pooled TCP succeeds and every subsequent one fails with EOF.

Three related changes:

  1. Drain server data before pool reuse (transport/snell/pool.go). When the local side closes first (typical for short HTTP/1 responses), the snell server may still have its reply tail + its own zero-chunk half-close queued in our TCP receive buffer. The previous `PoolConn.Close` just sent our zero chunk and put the conn back; the next session's first read then saw that leftover data. The new `drainPendingForReuse` reads frames until `shadowaead.ErrZeroChunk` is observed, with a 500 ms deadline. If drain times out or errors, the conn is closed instead of pooled.

  2. Cap each pooled TCP at 2 CONNECT sessions (`defaultMaxUsesPerConn`). Surge's snell-server closes a reuse-mode TCP after the second session, so without a cap the pool would hand the next caller a soon-to-be-dead socket. The cap is tracked per-conn via a new internal `pooledEntry` wrapper (no public API change beyond removing the never-called `Pool.Put`).

  3. Retry once on stale pool conn (`adapter/outbound/snell.go`). Pooled conns can also die between put and re-get (e.g. server idle timeout). `DialContext` now retries the `pool.GetContext` + `writeHeaderContext` pair once — the second attempt either yields another idle conn or falls through to the pool factory and dials fresh.

Reproduction

Tested against a real Surge snell-server v5.0.1 with 6 sequential HTTP GETs through a `reuse=true` outbound (forced fresh `proxy.DialContext` per request via `DisableKeepAlives: true`):

Before this change:
```
[1] OK
[2] FAIL ... EOF
[3] FAIL ... EOF
[4] FAIL ... EOF
[5] FAIL ... EOF
[6] FAIL ... EOF
5/6 requests failed
```

After this change:
```
[1] OK dur=272ms
[2] OK dur=365ms
[3] OK dur=245ms
[4] OK dur=278ms
[5] OK dur=292ms
[6] OK dur=300ms
all 6 requests succeeded
```

The alternating timing matches the `maxUsesPerConn=2` cap: fresh dial (~290 ms) followed by pool hit (~250 ms).

Test plan

  • `go build ./...`
  • `go vet ./transport/snell/... ./adapter/outbound/...`
  • `go test ./transport/snell/...` — updated 2 existing tests for the new `uses` field and pool entry wrapper; added 3 new tests (`TestPoolConnCloseDrainsAndReturnsToPool`, `TestPoolConnCloseDiscardsWhenDrainFails`, `TestPoolConnCloseDiscardsAtMaxUsesPerConn`, `TestPoolGetContextIncrementsUses`)
  • End-to-end against Surge snell-server v5.0.1 — see reproduction above

When the local side of a relay closes first (typical for short HTTP/1
responses), the snell server may still have data queued in our TCP
receive buffer — most importantly its own zero-chunk half-close.
Returning the conn to the pool without draining surfaces that stale
data on the next session's first read, breaking every reuse.

Also cap each pooled TCP at 2 CONNECT sessions: Surge's snell-server
(v5.0.1 observed) closes a reuse-mode TCP after the second session,
so anything beyond the cap is a soon-to-be-dead socket. The outbound
DialContext now retries a stale pool conn once so a write failure
on a checked-out conn falls through to a fresh dial transparently.

Repro against a real Surge snell-server v5.0.1: 6 sequential HTTP
GETs through a reuse=true snell outbound. Before this change, 5/6
fail with EOF after the first session; after, 6/6 succeed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant